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ABSTRACT 

The communities of a social network are sets of vertices with 
more connections inside the set than outside. We theoreti- 
cally demonstrate that two commonly observed properties of 
social networks, heavy-tailed degree distributions and large 
clustering coefficients, imply the existence of vertex neigh- 
borhoods (also known as egonets) that are themselves good 
communities. We evaluate these neighborhood communities 
on a range of graphs. What we find is that the neighbor- 
hood communities often exhibit conductance scores that are 
as good as the Fiedler cut. Also, the conductance of neigh- 
borhood communities shows similar behavior as the network 
community profile computed with a personalized PageRank 
community detection method. The latter requires sweeping 
over a great many starting vertices, which can be expen- 
sive. By using a small and easy-to-compute set of neigh- 
borhood communities as seeds for these PageRank commu- 
nities, however, we find communities that precisely capture 
the behavior of the network community profile when seeded 
everywhere in the graph, and at a significant reduction in 
total work. 
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1. INTRODUCTION 

Community detection, loosely speaking, is any process 
that takes a graph or network and picks out sets of re- 
lated nodes. An incredibly variety of techniques exist for 
this single task, which has a variety of names as well: com- 
munity detection, graph clustering, and graph partitioning. 
Throughout this manuscript, we shall use the term com- 
munity and cluster interchangeably. For more information 
about approaches for this problem, see the recent survey by 
Schaffer [34]. In many techniques, a community is defined 
as a set with a good score under a quality measure that re- 
flects the connectivity between the set and the rest of the 
network. Common measures are based on density of local 
edges, deviance from a random null model, the behavior of 
random walks, or graph cuts. Mostly, these measures are 
NP-hard to optimize. 

To keep this manuscript simple, we shall evaluate com- 
munities using their conductance store. Schaeffer identi- 
fied this measure as one of the most important cut-based 
measures and it has been studied extensively in a variety 
of disciplines [11,17,36]. Work by Leskovec et al. has re- 
cently demonstrated that, although different quality mea- 
sures produce differences in terms of specific communities, 
strong communities persist under a variety of measures [26] . 

A vertex neighborhood of 
a vertex v is the set of ver- 
tices directly connected to v 
via an edge and v itself. 
For example, see the green 
and black vertices at right. 
What we show here is that 
the presence of two commonly 
observed properties of mod- 
ern information networks - a 
large global clustering coeffi- 
cient [39j and a power-law degree distribution [5] - implies 
the existence of vertex neighborhoods with good conductance 
scores. We make this statement precise in Theorem 4.6. 
These results can be seen as an extension of the simple ob- 
servation that, in the extreme case when the global cluster- 
ing coefficient of a network is 1, then the network must be a 
union of cliques. Neighborhoods define ideal communities in 
this case. We mathematically show that this argument can 
be extended to the case when the graph has a power-law 
degree distribution and a large clustering coefficient. The 
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significance of this finding is that robust community de- 
tection need not employ complicated algorithms. Instead, 
a straightforward approach that just involves counting tri- 
angles - a function that is easy to implement in MapRe- 
duce [12] and easy to approximate [21], suffices to identify 
communities. It is intriguing that arguably the two most 
important measurable quantities of social networks imply 
that communities are very easy to find. This may lead to 
more mathematical work explaining the success of commu- 
nity detection algorithms, given that the problem are in gen- 
eral NP-hard. We note that unfortunately, our theoretical 
bounds reflect a worst case behavior and are weaker than 
required for practical use. Consequently, in the remainder 
of the paper we explore the utility of neighborhood commu- 
nities empirically. 

Section 2. The technical discussion of the manuscript be- 
gins by introducing our notation and precisely defines the 
quantities we examine, such as clustering coefficients, due to 
variability in the definitions of these measures. We also dis- 
cuss the Andersen-Chung-Lang personalized PageRank clus- 
tering scheme [2] and the network whiskers from Leskovec et 
al. [24,25]. We utilize the latter two algorithms as reference 
points for the success of our community detection. 

Section 3. We discuss some of the other observed prop- 
erties of egonets, or vertex neighborhoods, along with other 
related work including overlapping communities. 

Section 4. We state and prove the theoretical results 
that graphs with a power-law degree distribution and large 
clustering coefficients have neighborhood communities with 
good conductance scores. 

Section 5. We review the data that will serve as the 
testbed for our empirical evaluation of neighborhood cuts. 
This comes from a variety of public sources and spans collab- 
oration networks, social networks, technological networks, 
web networks, and random graph models. 

Section 6. Our empirical investigation of neighborhood 
clusters takes the following form. We first exhibit the con- 
ductance scores for the set of neighborhood communities for 
a few graphs (e.g. Figure 2). We find that neighborhood 
communities reflect the shape of the network community 
plot observed by Leskovec et al. [24, 25] at small size scales. 
We next compare the best neighborhood communities to 
those discovered by four other procedures: the Fiedler com- 
munity, the best personalized PageRank community (§2.3), 
the best network whisker (§2.3), and the best clusters from 
METIS [18]. In one third of the cases, the neighborhood com- 
munity is as good as the best of any of the other algorithms. 

Another outcome of the theory from §4 is that large cores 
must exist in these graphs. (Here, a graph fc-core is a subset 
of vertices where all nodes have degree at least k [35].) We 
conclude this section by exploring the community properties 
of the graph fc-cores. 

Section 7. Motivated by the success of the neighborhood 
communities at small size scales, we explore using the best 
vertex neighborhoods as seeds for a local greedy community 
expansion procedure and for the Andersen-Chung-Lang al- 
gorithm. Here, we find that these procedures, when seeded 
with an easy-to-identify set of neighborhood communities, 
produce larger clusters that decay as expected by the re- 
sults in Leskovec et al. [24, 25] 

We make all of our algorithm and experimental code, the 
majority of the data for the experiments, and some extra 
figures that did not fit into the paper available: 



Table 1: A summary of the notation. 

n = ]y] the number of vertices 

m = \E\ the number of edges 

dv the degree of vertex v 

fd the number of vertices of degree d 

W the set of wedges in a graph 

Wv the set of wedges centered at vertex v 

K the global clustering coefficient 

C the mean local clustering coefficient 

Cv the local clustering coefficient for vertex v 

Nr{v) the set of vertices within distance r or v 

E(S, T) the set of edges between S and T 

cut(S') the size of the cut around vertex set S 

vol (5) the sum of degrees (volume) of vertices in S 

edges (5) twice the number of edges among vertices in S 

4>[S) the conductance of vertex set S 



www . cs .purdue . edu/homes/dgleich/codes/nelghborhoods 

These codes are easy to use. Given the adjacency matrix of 
a network A, the single command 

» ncpneighs(A) 

will produce a figure analyzing the neighborhood communi- 
ties in comparison to the Fiedler community (formal defini- 
tion in Section 2.3). 

Summary of contributions. 

• We theoretically motivate the study of neighborhood 
communities by showing they often have a low conduc- 
tance in graphs with a power-law degree distribution 
and large clustering coefficients. 

• We empirically evaluate these neighborhood commu- 
nities and find them comparable to those communities 
found by other algorithms at small size scales. 

• We find a small set of neighborhood communities that 
can be grown into larger communities using a PageR- 
ank based community detection algorithm. The results 
match those communities found with a more expensive 
sweep over all communities. 

2. FORMAL SETTING AND NOTATION 

We first list out the various notations and formalisms used. 
All of the key notation is summarized in Table 1, and we 
briefly review it here. Let G = {V, E) be a loop-less, undi- 
rected, unweighted graph. We denote the number of vertices 
by n = ]V^] and the number of edges by m = ]iJ]. In terms 
of the adjacency matrix, m is half the number of non- zeros 
entries. For a vertex i;, let d„ be the degree of i;. For any 
positive integer d, let fd be the number of vertices of de- 
gree d, that is, the frequency of d in the degree distribution. 
The maximum degree is denoted by dmax- Let Dr{v) to be 
the distance r-neighborhood of v. This is the set of vertices 
whose shortest path distance from v is exactly r. Then, we 
define the ball of distance r around v, denoted by Nr{v), as 
the set Ui<^ Dr{v). 

2.1 Clustering coefficients 

A wedge is an unordered pair of edges that share an end- 
point. The center of the wedge is the common vertex be- 
tween the edges. A wedge {is,t), (s,m)} is closed if the edge 
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(t, u) exists, and is open otiierwise. We use W to denote the 
set of wedges in G, and Wv for the set of wedges centered 
at V. Note that |W"„| = C^^-). We set p„ = |W^„|/|W|. 

Social networks often have large clustering coe fficients [39]. 
Because of the varying definitions of this term that are used, 
we will denote by k the global clustering coefficient. This 
quantity is basically a normalized count of triangles. In the 
following, we think of w drawn uniformly at random from 
W. 



K = Pr [w is closed] = 



number of closed wedges 

W\ 



In terms of triangles, k — 3 - number of triangles/] W|. For 
any vertex v, Cv is the local clustering coefficient of v. We 
draw w uniformly at random from Wv 

number of closed wedges in Wv 



Cv = Pr \w is closed] — 



\Wv\ 



2.2 Cuts and Conductance 

Given a set of vertices S, the set S is the complement set, 
S = V\S. For disjoint sets of vertices S, T, E{S, T) denotes 
the edges between S and T. For convenience, we denote the 
size of the cut induced by a set \E{S,S)\ by cut(S'). 

The conductance of a cluster (a set of vertices) measures 
the probability that a one-step random walk starting in that 
cluster leaves that cluster. Let vol(5') denotes the sum of 
degrees of vertices in S and edges (S) denotes twice the num- 
ber of edges among vertices in S so that 

edges(S') = vol{S) - cut(S'). 

Then the conductance of set S, denoted 0(5), is 

cut (5") 



3l(S'),VOl(S))' 



Conductance is measured with respect to the set S or S with 
smaller volume, and is the probability of picking an edge 
from the smaller set that crosses the cut. Because of this 
property, conductance is preserved on taking complements: 
4>{S) = 4>{S). For this reason, when we refer to the number 
of vertices in a set of conductance cj), we always use the 
smaller set mindS], |S|). Figure 1 shows a few communities 
and their associated cuts and conductance scores from our 
methods and two points of comparison. 

2.3 Finding good conductance communities 

We briefly review three ways of identifying a community 
with a good conductance score. 

Fiedler set. 

The well-known Cheeger inequality defines a bound be- 
tween the second smallest eigenvalue of the normalized Lapla- 
cian matrix and the set of smallest conductance in a graph [11]. 
Formally, 

(1/2)A2 < minXS) < ^2AI 

where A2 is the second smallest eigenvalue of the normalized 
Laplacian. The proof is constructive. It identifies a set of 
vertices that obeys the upper-bound using a sweep cut. This 
is the smallest conductance cut among all cuts induced by 
ordering vertices by increasing values of sfd^Xv, where Xv is 
the component of the eigenvector associated with A2. This 
is the same idea used in normalized cut procedures [36] . We 



refer to the set identified by this procedure as the Cheeger 
community or Fiedler community. The latter term is based 
on Fiedler's work in using the second smallest eigenvalue of 
the combinatorial Laplacian matrix [14]. Figure lb shows 
the Fiedler community for the Les Miserables network. 

Personalized PageRank communities. 

Another highly successful scheme for community detection 
based on conductance uses personalized PageRank vectors. 
A personalized PageRank vector is the stationary distribu- 
tion of a random walk that follows an edge of the graph 
with probability a and "teleports" back to a fixed seed ver- 
tex with probability 1 — a. We use a = 0.99 in all exper- 
iments. The essence of the induced community is that an 
inexact personalized PageRank vector, computed via an al- 
gorithm that "pushes" rank round the graph, will identify 
good bottlenecks nearby a seed vertex. These bottlenecks 
can be formalized in a Cheeger-like bound [2]. The pro- 
cedure to find a personalized PageRank community is: i) 
specify a value of a, a seed vertex v, and a desired dusted 
size a; ii) solve the personalized PageRank problem using 
the algorithm from [2] until a degree-weighted tolerance of 
T — l/(10cr); and iii) sweep over all cuts induced by the or- 
dering of the personalized PageRank vector (normalized by 
degrees) and choose the best. Personalized PageRank com- 
munities (PPR communities, for short) were used to identify 
an interesting empirical property of communities in large 
networks [24,25]. To generate these plots, those authors ex- 
amined a range of values of a for a large number of vertices 
of the graph and summarized the best communities found 
at any size scale in a network community plot. Figure Id 
shows the best personalized PageRank community for the 
network of character interactions in Les Miserables. 

Whisker communities. 

Perhaps the best point of comparison with our approach 
are the whisker communities defined by Leskovec et al. [24, 
25]. These communities are small dense subgraphs con- 
nected by a single edge. They can be found by looking at 
any subgraph connected to the largest biconnected compo- 
nent by a single edge. A biconnected component remains 
connected after the removal of any vertex. Note that the 
largest biconnected component is not necessarily a 2-core of 
the graph. Leskovec et al. observed that many of these sub- 
graphs are rather dense. Each subgraph has a cut of exactly 
one, and consequently, a productive means of finding sets 
with low conductance is to sort these subgraphs by their 
volume. The best whisker cut is the single subgraph with 
largest volume. 

3. RELATED WORK 

We are hardly the first to notice that vertex neighbor- 
hoods have special properties. 

Egonets, homophily, and structural holes. 

In the context of social networks, vertex neighborhoods 
are often called egonets because they reflect the the state of 
the network as perceived by a single vertex. Their analysis 
is a key component in the study of social networks [38] , espe- 
cially in terms of data collection. Studies of these networks 
often focus on the theory of structural holes, which is the 
notion that an individual can derive an advantage from serv- 
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(a) Best neighborhood (b) Fiedler community (c) Best fc-core (d) Best ppr community 

size=8, cut=10, <?!-=0.15 size=36, cut=29, (j!>=0.13 size=12, cut=34, (j!>=0.22 size=28, cut=31, 0=0.12 



Figure 1: A series of vertex sets and their associated sizes and conductance score on the graph of characters 
from Les Miserables [20]. The best neighborhood and best fc-core are two of the communities we discuss 
further in §6. See §2.3 for information on the Fiedler and PPR communities. 



ing as a bridge between disparate groups [10]. These bridge 
roles are interesting because they contradict homophily in 
social ties. Homophily, or the principle that similar indi- 
viduals form ties, is the mechanism that is expected to pro- 
duce networks with large local clustering coefficients [28]. 
These social theories have prompted the development of new 
methods to tease apart some of these effects in real-world 
networks [22], and to develop network models that capture 
structural holes [19]. 

Clustering and communities. 

Vertex neighborhoods often play a role in other techniques 
to find community or clustering structure in a network. Over- 
lap in the neighborhood sets of vertices is a common ver- 
tex similarity metric used to guide graph clustering algo- 
rithms [34]. Other schemes utilize vertex neighborhoods 
as good seed sets for local techniques to grow communi- 
ties [16,33]. We explore using a carefully chosen set of 
neighborhoods for this purpose in our final empirical dis- 
cussion (§7). Perhaps the most closely related work is a 
recent idea to utilize the connected components of ego-nets, 
after their ego vertex is removed, to produce a good set of 
overlapping communities [32]. Our theoretical results estab- 
lish that these ideas are highly likely to succeed in networks 
with local clustering and power-law degree distributions. 

Graph properties. 

Much of the modern work on networks rests on surpris- 
ing empirical observations about the structure of real world 
connections. For instance, information networks were found 
to have a power-law in the degree distribution [5, 13]. These 
same networks were also found to have considerable local 
structure in the form of large clustering coefficients [39], but 
retained a small global diameter. Our theory shows that a 
third potential observation - the existence of vertex neigh- 
borhood with low conductance - is in fact implied by these 
other two properties. We formally show that heavy tailed 
degree distributions and high clustering coefficients imply 
the existence of large dense cores. 

Anomoly detection. 



Predictable behavior in the structure of ego-nets makes 
them a useful tool for detecting anomalous patterns in the 
structure of the network. For instance, Akoglu et al. [1] com- 
pute a small collection of measures on each egonet, such as 
the average degree and largest eigenvalues. Outliers in this 
space of vertices are often rather anomalous vertices. Our 
work is, in contrast, a precise statement about the regular- 
ity of the ego-nets, and says that we always expect a large 
ego-net to be a good community. 

Summary. 

Although we are not the first to study neighborhood based 
communities, the relationship between the local clustering, 
power-law degree distributions, and large neighborhoods with 
small conductance does not appear to have been noticed be- 
fore. 

4. THEORETICAL JUSTIFICATION FOR 
NEIGHBORHOOD COMMUNITIES 

The aim of this section is to provide some mathematical 
justification for the success of neighborhood cuts. Our aim 
is to show that heavy tailed degree distributions and large 
clustering coefficients imply the existence of neighborhood 
cuts with low conductance and large dense cores. As men- 
tioned earlier, the exact bounds we get are somewhat weak 
and only hold when the clustering coefficient is extremely 
large. Nonetheless, the proofs give significant intuition into 
why neighborhoods are good communities. 

We begin with the extreme case when the value of n is 
1 (so every wedge is closed). Then we have the following 
simple claim. 

Claim 4.1. Suppose the global clustering coefficient of G 
is 1. Then G is the union of disjoint cliques. 

Proof. Consider two vertices u and v that are connected. 
Suppose the shortest path distance between them is £ > 
1. Then the shortest path has at least 3 distinct vertices 
(including u and v). Take the last three vertices on this 
path, vi,V2,v. This forms a wedge at V2, and must be closed 
(since the clustering coefficient is 1). Hence, the edge {vi,v) 
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exists and there exists a path between it and v of length less 
than £. This is a contradiction. 

Hence, any two connected vertices have a shortest path 
distance of 1, i.e., are connected by an edge. The graph is a 
disjoint union of cliques. □ 

Note that the neighborhood of any vertex in the above 
claim forms a clique disconnected from the rest of G. There- 
fore, all neighborhoods form perfect communities, in this 
extremely degenerate case. We prove this for more general 
settings. The quantities Pv = |W^u|/|lV|, form a distribu- 
tion over the set of vertices. Since we are performing an 
asymptotic analysis, we will use o(l) to denote any quantity 
that becomes negligible as the graph size increases. We will 
choose /? to be a constant less than 1. It is quite unimportant 
for the asymptotic analysis what this constant is. From a 
pratical standpoint, think of /? as a constant such that most 
edges are incident to a vertex of degree at least d^ax (2/3 
is usually a reasonable value). Also, we will assume that 
the power law exponent is at most 3, a fairly acceptable 
condition. 

Claim 4.2. Let S be the set of vertices with degrees more 
than d!^ax- Then, Y^vesP^ = 1 ~ ''(I)- 

Proof. We can set p„ = (2|W„|)/(2|W^|). For conve- 
nience, set di = d^ax and d2 = dmax- We have fd ~ anjd^ , 
for some constant a and 7 < 3. 

^ 2\W-o\ ~ X] dVd ~ an ^ d'-^ « a'n(d?-^ - d 

vGS d—di d—di 



The total number of wedges behaves like a'nd^ and hence, 
2J:.^s\W.\=2\W\-o{\W\). □ 



Claim 4.3. J2v P"'^^ = 
Proof. 

\Wv\ number of closed wedges in Wv 



\w\ \w^\ 

(# closed wedges in Wv) 
\W\ 



= K. □ 



We come to our important lemma. This argues that on the 
average, neighborhood cuts must have a low conductance. 



Lemma 4.4. 



cut{Ni{v)) 



2(1 - k) 



Proof. We express the sum of cut(A'^i(i;)) as a double 
summation, and perform some algebraic manipulations. 

J2'^MN,{v)) = J2 H \N,{u)\{N^{v)U{v})\ 
= E E \Niiu)\{N^{v)U{v})\ 

u v£Ni{u) 

= (# Open wedges centered 

u v£Ni(u) 

at u involving edge (li, v)) 
= open wedges centered at u) 

u 

= 2(l-fi:)|Vy| 



We complete the proof with the following simple observa- 
tion: 



E 



Pv- 



cut{N^{v))\ _ E.cut(iVi(iO) 



\w\ 



□ 



Theorem 4.5. There exists a k-core in G fork > nd^ax/'^- 
Proof. By Claims 4.2 and 4.3, 

K. = ^PuCi, = ^PuC„ + ^p„Cu < ^p^C„ -I- 0(1) 

V v£S v^'s 

This implies that there exists some vertex v such that rf„ > 
d^t,^, and C„ > ft — o(l) (for convenience, we are going to 
drop the o(l) lower order term). Consider G' , the induced 
subgraph of G on Ni{v). The total number of vertices is 
exactly d„ H- 1. Because a k- fraction of the wedges centered 
at V are closed, the number of edges in G' is at least f^{'^2)- 
So G' is a dense graph, and we will show that it contains a 
large core. Perform a core decomposition on G' . We itera- 
tively remove the vertex of min-degree until the graph has 
no edges left. The total number of iterations is atmost dv. 
Let the degree of the removed vertex at iteration i be ei. We 
have X]i<i<d ~ '''C2")- averaging argument, there 

exists some i such that a > K,(dv — l)/2. At this point, all 
(unremoved) vertices of G' must have a degree of at least 
(di, — l)/2, forming a fc-core with k > Kdl^^axf^- D 

We come to our main theorem that proves the existence 
of a neighborhood cut with low conductance. When k = 1, 
we get back the statement of Claim 4.1, since we have a set 
of conductance 0. But this theorem also gives non-trivial 
bounds for large values of k. As we mentioned earlier, when 
K becomes small, this bound is not useful any longer. 

Theorem 4.6. There exists a neighborhood cut with con- 
ductance at least 4(1 — k)/(3 — 2k). 

Proof. The proof uses the probabilistic method, given 
the bounds of Lemma 4.4 and Claim 4.3. Suppose we choose 
a vertex v according to the probability distribution given by 
Pv Let X denote the random variable cut{Ni{v)) /\Wv\, 
so E[X] = 2(1 — k) (Lemma 4.4). By Markov's inequality, 
Pr[X > 4(1 - k)] < 1/2. 

Set a = 2k — 1, and set Pr[Ci, < a] — p. 

K < pa + {1 - p) =^ p < (1 - k)/{1 - a) ^ 1/2 

By the union bound, the probability that cut(A'^i(u))/|VK„| > 
4(1 — k) or Cv < a is less than 1. Hence, there exists some 
vertex v such that cut(A'^i(v)) < 4(1 — fi:)|Wi,| and Cv > ce 
(we can also show that d„ > n^). Let E be the set of edges in 
the subgraph induced on N-i_{v). Since Cv > a, \E\ > a\Wv\- 
We can bound the conductance of Ni{v), 



CTlt{Nl{v)) 

\E\ + cut(Ni{v)) 



4{1~k)\Wv\ 



■ 4k 



q|W^„| +4(1-k)|W"„| 3-2/t' 



□ 



5. DATA 

Before we begin our empirical comparison, we first dis- 
cuss the data we use to compare and evaluate algorithms. 
These come from a variety of sources. See Table 2 for a 
summary of the networks and their basic statistics. All net- 
works are undirected and were symmetrized if the original 
data were directed. Also, any self-loops in the networks were 
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Table 2: Datasets for our experiments. The five 
types are: collaboration networks, social networks, 
technological networks, web graphs, and forest fire 
models. 



Graph 


Verts 


Edges 


Avg. 


Max 


K 


C 








Deg. 


Deg. 






ca-AstroPh 


17903 


196972 


22.0 


504 


0.318 


0.633 


email-Enron 


33696 


180811 


10.7 


1383 


0.085 


0.509 


cond-mat-2005 


36458 


171735 


9.4 


278 


0.243 


0.657 


arxiv 


86376 


517563 


12.0 


1253 


0.560 


0.678 


dblp 


226413 


716460 


6.3 


238 


0.383 


0.635 


nollywood-i009 


1069126 


56306653 


105.3 


11467 


0.310 


0.766 


fb-Penn94 


41536 


1362220 


65.6 


4410 


0.098 


0.212 


fb- A-oneyear 


1138557 


4404989 


7.7 


695 


0.038 


0.060 


fb-A 


3097165 


23667394 


15.3 


4915 


0.048 


0.097 


soc-LiveJournall 


4843953 


42845684 


17.7 


20333 


0.118 


0.274 


oregon2-010526 


11461 


32730 


5.7 


2432 


0.037 


0.352 


p2p-Gnutella25 


22663 


54693 


4.8 


66 


0.005 


0.005 


as-22july06 


22963 


48436 


4.2 


2390 


0.011 


0.230 


itdk0304 


190914 


607610 


6.4 


1071 


0.061 


0.158 


web- Google 


855802 


4291352 


10.0 


6332 


0.055 


0.519 


ff-0.4 


25000 


56071 


4.5 


112 


0.283 


0.412 


ff-0.49 


25000 


254180 


20.3 


1722 


0.148 


0.447 



discarded. We only look at the largest connected component 
of the network. There are five types of networks: 

Collaboration networks In these networks, the nodes 
represent people. The edges represent collaborations, ei- 
ther via a scientific publication (ca- AstroPh [23] , cond-mat- 
2005 [31], arxiv [9], dblp [7,8]), an email (email-Enron [25]), 
or a movie (holly wood-2009 [7, 8]). These networks have 
large mean clustering coefficients and large global clustering 
coefiicients. 

Social networks The nodes are people again, and the 
edges are either explicit "friend" relationships (fb-Penn94 [29] , 
fb-A [40], soc-Live Journal [4]) or observed network activity 
over edges in a one-year span (fb- A-oneyear [40]). 

Technological networks The nodes act in a distributed 
communication network either as agents (p2p-Gnutella25 [27]) 
or as routers (oregon2 [23], as-22july06 [30], itdk0304 [37]). 
The edges are observed communications between the nodes. 

Web graphs The nodes are web-pages, and the edges are 
symmetrized links between the pages [25]. 

Forest fire models We also explore the forest fire graph 
model [23]. This model has large clustering coefficients and 
a highly skewed degree distribution. The model grows a 
network by adding a node at each step. On arrival, a new 
node picks a template uniformly at random from the existing 
nodes, and then the process "burns" around that node with 
a specified probability. Burned nodes are then connected 
to the new node. It has three parameters: the size of the 
initial clique k, the probability of following an edge in the 
burning process p, and the total number of nodes n. We 
specify fc = 2 and n = 25000, and explore two choices for p: 
short-burning p = 0.4 and long-burning p — 0.49. 



6. EMPIRICAL NEIGHBORHOOD 
COMMUNITIES 

To compute the conductance scores for each neighborhood 
in the graph, we adapt any procedure to compute all local 
clustering coefiicients. Most of the work to compute a local 
clustering coefficient is performed when finding the number 
of triangles at the vertex. We can express the number of 
triangles as edges(Z)i(w))/2 = (edges(A''i(u))/2 — 2dv, that 
is, half the number of edges between immediate neighbors of 
V (recall that we double-count edges). Then cut{Ni{v)) = 
vol{Ni{v)) — edges{Ni{v)). And so, given the number of tri- 
angles, we can compute the cut assuming we can compute 
the volume of the neighborhood. This is easy to do with any 
graph structure that explicitly stores the degrees. We also 
note that it's easy to modify Cohen's procedure for comput- 
ing triangles with MapReduce [12] to compute neighborhood 
conductance scores. Two extra steps are required: i) map 
each triangle back to its constituent nodes, then reduce to 
find the number of triangles at each node; and ii) map the 
joined edge and degree graph to both vertices in the edge, 
then sum the degrees of the neighborhood in the reduce. 

We use the network community plot from Leskovec et al. [24] 
to show the information on all of the neighborhood commu- 
nities. Given the conductance scores from all the neigh- 
borhood communities and their size in terms of number of 
vertices, we first identify the best community at each size. 
The network community plot shows the relationship between 
best community conductance and community size on a log- 
log scale. In Leskovec et al., they found that these plots had 
a characteristic shape for modern information networks: an 
initial sharp decrease until the community size reaches be- 
tween 100 and 1000, then a considerable rise in the conduc- 
tance scores for larger communities. In our case, neighbor- 
hood communities cannot be any larger than the maximum 
degree plus one, and so we mark this point on the graphs. 
We always look at the smaller side of the cut, so no commu- 
nity can be larger than half the vertices of the graph. We 
also mark this location on the plots. Each subsequent figure 
utilizes this size-vs-conductance plot. Note that we delib- 
erately attempt to preserve the axes limits across figures to 
promote comparisons. However, some of the figures do have 
different axis limits to emphasize the range of data. 

First, we show these network community plots, or per- 
haps better termed neighborhood community plots for our 
purposes, for six of the networks in Figure 2. These figures 
are representative of the best and worst of our results. As 
a reminder, we make all summary data and codes available 
online. Plots for other graphs are available on the website 
given in the introduction. 

The three graphs on the left show cases where a neighbor- 
hood community is or is nearby the best Fiedler community 
(the red circle). The three graphs on the right highlight in- 
stances where the Fiedler community is much better than 
any neighborhood community. We find it mildly surprising 
that these neighborhood communities can be as good as the 
Fiedler community. The structure of the plot for both fb-A- 
oneyear and soc-LiveJournall is instructive. Neighborhoods 
of the highest degree vertices are not community-like - sug- 
gesting that these nodes are somehow exceptional. In fact, 
by inspection of these communities, many of them are nearly 
a star graph. However, a few of the large degree nodes de- 
fine strikingly good communities (these are sets with a few 
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Figure 2: The best neighborhood community con- 
ductance at each size (black) and the Fiedler com- 
munity (red). (Note the axis limits on ca-AstroPh). 



hundred vertices with conductance scores of around 10~^). 
This evidence concurs with the intuition from Theorem 4.6. 

Note that all of these plots show the same shape Leskovec et al. 
observed. Consequently, in the next set of figures, and in 
the remainder of the empirical investigation, we compare 
our neighborhood communities against those computed via 
the personalized PageRank community scheme employed in 
that work and described in Section 2.3. 

Second, Figure 3 compares the neighborhood communi- 
ties to those computed by sweeping the local personalized 
PageRank algorithm over all of the vertices as described 
by Leskovec et al. [24]. We also show the behavior of the 
whisker communities in this plot as well. The plot adopts 
the same style of figure. The PageRank communities are in 
a deep blue color, and the whisker communities are show 
in a shade of green. Here, we see that the neighborhood 
communities show similar behavior at small size scales (less 
than 20 vertices), but the personalized PageRank algorithm 
is able to find larger communities of smaller, or similar con- 
ductance. In these four cases (which are representative of all 
of the remaining figures), one of the personalized PageRank 
communities was the Fiedler community. 

Based on this observation, we wanted to understand how 
the best community identified by a range of algorithms com- 



Figure 3: A comparison of neighborhood commu- 
nities (black) personalized PageRank communities 
(blue), and whiskers (green). 



pares to the neighborhood communities. This is what our 
third exploration does. The results are shown in Table 3. 
We computed a set of communities with metis by repeat- 
edly calling the algorithm, asking it to use more partitions 
each time. See our online codes for the precise details of 
which partitions were used. 

By-and-large, the Fiedler cut, personalized PageRank, whiskers, 
and METIS all tend to identify similar communities as the 
best. There are sometimes small differences. An example of 
a large difference is in the Penn94 graph, where the Fiedler 
community is much larger than the best PageRank commu- 
nity and it has better conductance. In this comparison, the 
neighborhood communities fare poorly. When they identify 
a set of conductance that's as good as the rest, then it is 
[24l|ways a whisker as well. In the following full section, we 
explore using these neighborhood communities as seeds for 
the PageRank algorithms. This will let us take advantage of 
the observation that the neighborhood communities refiect 
the shape of the network community plot with PageRank 
communities 

6.1 Empirical Core Communities 

In our theoretical work, we found that large fc-cores should 
always exist in these networks. These should also look like 
good communities and we briefly investigate this idea in 
Figure 4. The standard procedure for computing fc-cores is 
to iteratively remove in degree-sorted order using a bucket 
sort [6]. We additionally store the step when each vertex 
was removed from the graph. We sweep over all cuts in- 
duced by this ordering, and for each fc-core, store the best 
conductance community. These are plotted in a line that 
runs from core 1 to the largest core in the graph. The 1 
core is usually large and a bad-community. Thus, the line 
usually starts towards the upper-right of each network com- 
munity plot. Large cores are actually rather good commu- 
nities. Their conductance scores are noticeably higher than 
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Table 3: The single best community detected by any of the five methods explore. 
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Figure 4: Network community plots with neigh- 
borhood communities (gray), PageRank communi- 
ties (light blue), whiskers (green) and k-cores (dark 
blue). 

the PageRank communities, but the network plots seem to 
have similar shapes. We'll exploit this property in the next 
section. 

7. SEEDED COMMUNITIES 

Many of the theorems about extracting local communities 
from seed sets [2,3] require that the seed set itself be a good 
community. This is precisely what our theoretical results 
justify for neighborhood communities. Consequently, in this 
section, we look at growing the neighborhood communities 
using the local personalized PageRank community algorithm 
from a set of carefully chosen seeds. 

One of the key problems with using the personalized PageR- 
ank community algorithms is that finding a good set of seeds 
is not easy. For example, [15] describes a way to do this us- 
ing the most popular videos on YouTube. Such a meaning- 
ful heuristic is not always available. We begin this section 
by empirically showing that there is an easy-to-identify set 
of neighborhood communities that are local extrema in the 
network community plot of the neighborhood communities. 
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Figure 5: The conductance of locally minimal com- 
munities in the itdk0304 graph (red). Note that 
these capture most of the local minima (downward 
spikes) in the profile. 

First, some quick terminology: we say a neighborhood 
community is a local minima, or locally minimal, if the con- 
ductance of the neighborhood of a vertex is smaller than the 
conductance of any of the adjacency neighborhood commu- 
nities. Formally, 

(t>{Ni{v)) < 0(iVi(™)) 
for all w adjacent to v 

is true for any locally minimal communities. We find there 
are only a small set of locally minimal communities with 
more than 6 vertices. Shown in Figure 5 are the conduc- 
tance and sizes of the roughly 7000 communities identified 
by this measure for the itdk0304 graph. Indeed, among all of 
the graphs with at least 85, 000 vertices, this heuristic picks 
out about 3% of the vertices as local minima. In the worst 
case, it picked out 100,000 seeds for soc-LiveJournall. In- 
creasing the minimum size to 10 vertices reduces this down 
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size=28, cut=31, 0=0.12 



Figure 6: (Left) The center vertices of the locally 
minimal vertex neighborhoods in the Les Miserables 
are marked in red. (Right) The best pageRank 
grown community from these vertices matches the 
best from any seed. 

to 50, 000 seeds. We then use these locally minimal neigh- 
borhoods as seed sets for the personalized PageRank com- 
munity detection procedure. Each locally minimal neigh- 
borhoods is grown by up to 50-times its volume by solving 
for communities using various values of a up to 50. We also 
explore growing the fc-cores by up to 5 times their volume. 
See Figure 6 for the locally minimal communities and the 
best grown community from the Les Miserables graph. 

Figure 7 shows the results. In these figures, we leave the 
baseline neighborhood communities in for comparison. The 
key insight is that the dark black line closely tracks the 
the outline of the pure-PageRank based community pro- 
file. That profile was computed by using every vertex in 
the graph as a seed (although, some vertices were skipped 
after 10 other clusters had already visited that vertex). This 
effect is most clearly illustrated by the email-Enron dataset. 
The dark black line identifies almost all of the local minima 
from the full PageRank sweep (there are a few it misses). A 
weakness of these minimal seeds for PageRank is that they 
may not capture the largest communities. However, the k- 
core grown communities do seem to capture this region of 
the profile (e.g. arxiv), although ca-AstroPh is an exception. 

8. CONCLUDING DISCUSSIONS 

We recap. Community detection is the problem of find- 
ing cohesive collections of nodes in a network. We formalize 
this as finding vertex sets with small conductance. Mod- 
ern information networks have many distinctive properties, 
including a large clustering coefficient and a heavy-tailed de- 
gree distribution. We derive a set of theoretical results that 
show these properties imply that such networks will have 
vertex neighborhoods that are themselves sets of small con- 
ductance. Although our theoretical bounds are weak, they 
suggest the following experiment: measure the conductance 
of vertex neighborhoods. 

Algorithms to compute all such conductance scores are 
easy to implement by modifying a routine for computing lo- 
cal clustering coefficients. We evaluate these communities 
on a set of real-world networks. In summary, our results 
support the idea that there are many neighborhood commu- 



ca-AstroPh email-Enron 




10° io' 10^^ 10^ 10* lo'^ 10° io' lo'^ io' 10" 10^ 
arxiv fb-A 




10° 10' 10^ 10' 10" 10'^ 10° 10' lo'^ 10' 10" lo'^ 



Figure 7: Network community plots with neighbor- 
hood communities (gray), PageRank communities 
(light blue), whiskers (green), k-cores (purple), lo- 
cally minimal seed PageRank communities (black), 
and k-core seeded PageRank communities (red). 

nities which are good communities in a conductance sense. 
They may be smaller than desired, however. 

We next investigate finding a set of locally minimal com- 
munities. These communities represent the best of the neigh- 
borhood. We find that these locally minimal communities, of 
which there are many fewer than vertices in the graph (usu- 
ally around 3%), capture the local minimal in the network 
community profile plot. More importantly, they can be en- 
larged using a local personalized PageRank community de- 
tection procedure. Afterwards, the profile of these "grown" 
neighborhoods is strikingly close to the profile of the PageR- 
ank communities when seeded with all vertices individually. 
While we do not discuss timing due to the variability in 
the quality of implementations, this later procedure is much 
faster in our experiments. 

These findings have implications for future studies in com- 
munity detection. One explanation for the results with the 
PageRank seeds is that vertex neighborhoods form the core 
of any good community in the network. We highlight this 
as a direction for future research into neighborhood commu- 
nities. 
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